{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "eRrAUUu-t-O2"
      },
      "source": [
        "# Behaviour suite\n",
        "\n",
        "(github.com/deepmind/bsuite/)[https://github.com/deepmind/bsuite]\n",
        "\n",
        "This is the official results page for `bsuite`. You can use this to:\n",
        "- Get a snapshot of agent performance.\n",
        "- Diagnose strengths/weaknesses of your agent.\n",
        "- Leverage ready-made plots and analysis"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "qXOubWdlH9C0"
      },
      "outputs": [],
      "source": [
        "#@title Imports\n",
        "\n",
        "# pylint: disable=unused-import\n",
        "\n",
        "from __future__ import absolute_import\n",
        "from __future__ import division\n",
        "from __future__ import print_function\n",
        "\n",
        "import warnings\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import plotnine as gg\n",
        "\n",
        "# You can implement your own logging system, and use this to read results.\n",
        "# bsuite import section.\n",
        "# begin bsuite imports.\n",
        "from bsuite.logging import csv_load\n",
        "from bsuite.logging import sqlite_load\n",
        "# end bsuite imports.\n",
        "\n",
        "pd.options.mode.chained_assignment = None\n",
        "gg.theme_set(gg.theme_bw(base_size=16, base_family='serif'))\n",
        "gg.theme_update(figure_size=(12, 8), panel_spacing_x=0.5, panel_spacing_y=0.5)\n",
        "warnings.filterwarnings('ignore')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gcpZiexEmjdf"
      },
      "source": [
        "##  Overall `bsuite` scores\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dIJBQqCDp5aP"
      },
      "source": [
        "Load your experiments below. We recommend a maximum of 5 result sets, for clarity of analysis.\n",
        "\n",
        "The input to the `load_bsuite` function is a dict that maps from an experiment name of your choosing to the result path.\n",
        "\n",
        "For an experiment that used CSV logging, this would map to the directory containing the results. For SQLite logging, this would map to the database file for that experiment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "TnqNuenpr61Y"
      },
      "outputs": [],
      "source": [
        "#@title loading results from local data:\n",
        "\n",
        "experiments = {}  # Add results here\n",
        "DF, SWEEP_VARS = sqlite_load.load_bsuite(experiments)\n",
        "# Or\n",
        "# DF, SWEEP_VARS = csv_load.load_bsuite(experiments)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "plQLUbWPpUhv"
      },
      "outputs": [],
      "source": [
        "#@title overall score as radar plot (double-click to show/hide code)\n",
        "BSUITE_SCORE = summary_analysis.bsuite_score(DF, SWEEP_VARS)\n",
        "BSUITE_SUMMARY = summary_analysis.ave_score_by_tag(BSUITE_SCORE, SWEEP_VARS)\n",
        "__radar_fig__ = summary_analysis.bsuite_radar_plot(BSUITE_SUMMARY, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5ANabuXdFAFS"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Snapshot of agent behaviour across key metrics as measured by bsuite.\n",
        "- Length of each \"spoke\" represents score between 0 and 1.\n",
        "- For more detailed analysis, click into specific challenge domains."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ds789Mrq5LmR"
      },
      "source": [
        "### Plotting scores per challenge in bar plot (click to show)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "8VGZIvfGtZ4m"
      },
      "outputs": [],
      "source": [
        "#@title plotting overall score as bar (double-click to show/hide code)\n",
        "summary_analysis.bsuite_bar_plot(BSUITE_SCORE, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "QsUzPzmrG208"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Height of each bar is the score on each challenge domain.\n",
        "- Partially-finished runs are shown with transparent bars.\n",
        "- Parameter/agent sweeps are automatically [faceted](http://www.sthda.com/english/wiki/ggplot2-facet-split-a-plot-into-a-matrix-of-panels) side by side.\n",
        "- For more detailed analysis, click into specific challenge domains."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "iR-J9iZWPDNW"
      },
      "outputs": [],
      "source": [
        "#@title compare agent performance on each challenge (double-click to show/hide code)\n",
        "summary_analysis.bsuite_bar_plot_compare(BSUITE_SCORE, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nJcxdHc9Ps7k"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Height of each bar is the score on each challenge domain.\n",
        "- Partially-finished runs are shown with transparent bars.\n",
        "- Each \"facet\" focuses on a separate environment.\n",
        "- This plot allows for easier comparison between agents.\n",
        "- For more detailed analysis, click into specific challenge domains."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iaa0FgSoMu3T"
      },
      "source": [
        "# Individual challenge domains\n",
        "\n",
        "This section of the report contains specific analysis for each individual `bsuite` experiment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "x0Kh71xBGovc"
      },
      "outputs": [],
      "source": [
        "#@title Import experiment-specific analysis\n",
        "# bsuite import section.\n",
        "# begin bsuite imports.\n",
        "from bsuite.experiments.bandit import analysis as bandit_analysis\n",
        "from bsuite.experiments.bandit_noise import analysis as bandit_noise_analysis\n",
        "from bsuite.experiments.bandit_scale import analysis as bandit_scale_analysis\n",
        "from bsuite.experiments.cartpole import analysis as cartpole_analysis\n",
        "from bsuite.experiments.cartpole_noise import analysis as cartpole_noise_analysis\n",
        "from bsuite.experiments.cartpole_scale import analysis as cartpole_scale_analysis\n",
        "from bsuite.experiments.cartpole_swingup import analysis as cartpole_swingup_analysis\n",
        "from bsuite.experiments.catch import analysis as catch_analysis\n",
        "from bsuite.experiments.catch_noise import analysis as catch_noise_analysis\n",
        "from bsuite.experiments.catch_scale import analysis as catch_scale_analysis\n",
        "from bsuite.experiments.deep_sea import analysis as deep_sea_analysis\n",
        "from bsuite.experiments.deep_sea_stochastic import analysis as deep_sea_stochastic_analysis\n",
        "from bsuite.experiments.discounting_chain import analysis as discounting_chain_analysis\n",
        "from bsuite.experiments.memory_len import analysis as memory_len_analysis\n",
        "from bsuite.experiments.memory_size import analysis as memory_size_analysis\n",
        "from bsuite.experiments.mnist import analysis as mnist_analysis\n",
        "from bsuite.experiments.mnist_noise import analysis as mnist_noise_analysis\n",
        "from bsuite.experiments.mnist_scale import analysis as mnist_scale_analysis\n",
        "from bsuite.experiments.mountain_car import analysis as mountain_car_analysis\n",
        "from bsuite.experiments.mountain_car_noise import analysis as mountain_car_noise_analysis\n",
        "from bsuite.experiments.mountain_car_scale import analysis as mountain_car_scale_analysis\n",
        "from bsuite.experiments.umbrella_distract import analysis as umbrella_distract_analysis\n",
        "from bsuite.experiments.umbrella_length import analysis as umbrella_length_analysis\n",
        "# end bsuite imports."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dwIcX62dDnNE"
      },
      "source": [
        "## Basic\n",
        "\n",
        "\n",
        "We begin with a collection of very simple decision problems with standard analysis:\n",
        "- Does the agent learn a reasonable rewarding policy?\n",
        "- How quickly do they learn simple tasks?\n",
        "\n",
        "We call these experiments \"basic\", since they are not particularly targeted at specific core issues.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vQmNzVbBDqZa"
      },
      "source": [
        "### Bandit\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fjweihnXblOP"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/bandit.png\" alt=\"bandit diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "A simple independent-armed bandit problem.\n",
        "\n",
        "- The agent is faced with 11 actions with deterministic rewards [0.0, 0.1, .., 1.0] randomly assigned.\n",
        "- Run over 20 seeds for 10k episodes.\n",
        "- Score is 1 - 2 * average_regret at 10k episodes.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "h5um7tOPDpju"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "bandit_df = DF[DF.bsuite_env == 'bandit'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'bandit', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "j583NAoLD5nD"
      },
      "outputs": [],
      "source": [
        "#@title plot average regret through learning (lower is better)\n",
        "bandit_analysis.plot_learning(bandit_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "asfed9wuEbO7"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of the agent averaged over 20 seeds.\n",
        "- Random policy has reward of 0  = regret of 0.5 = dashed line\n",
        "- Want to see a stable learning curve -\u003e 0 and fast!\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "t600GCEj1qCu"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "bandit_analysis.plot_seeds(bandit_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_ypLP6DZHZc8"
      },
      "source": [
        "### MNIST\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "woT2ar_fbjy3"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/mnist.png\" alt=\"mnist diagram\" height=\"300\"/\u003e\n",
        "\n",
        "The \"hello world\" of deep learning, now as a contextual bandit.\n",
        "\n",
        "- Every timestep the agent must classify a random MNIST digit.\n",
        "- Reward +1 for correct, -1 for incorrect.\n",
        "- Run for 10k episodes, 20 seeds.\n",
        "- Score is percentage of successful classifications.\n",
        "- Must log `episode`, `total_regret` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "77KttSBsHZc-"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "mnist_df = DF[DF.bsuite_env == 'mnist'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'mnist', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "dxjKHqPaHZdB"
      },
      "outputs": [],
      "source": [
        "#@title plot average regret through learning (lower is better)\n",
        "mnist_analysis.plot_learning(mnist_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gMAtScV4HZdH"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of the agent averaged over 20 seeds.\n",
        "- Random policy has reward of 0  = regret of 1.8 = dashed line\n",
        "- Want to see a stable learning curve -\u003e 0 and fast!\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "zmCi_x8F4jCt"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "mnist_analysis.plot_seeds(mnist_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "QWtMyhFpNYC9"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GrTjfY11MD5E"
      },
      "source": [
        "### Catch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "7MOVEQunM9QB"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/catch.png\" alt=\"catch diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "DeepMind's internal \"hello world\" for RL agents.\n",
        "\n",
        "- The environment is a 5x10 grid with a single falling block per episodes (similar to Tetris).\n",
        "- The agent controls a single \"paddle\" pixel that it should use to \"catch\" the falling block.\n",
        "- If the agent catches the block reward +1, if the agent misses the block reward -1.\n",
        "-   Run the agent for 10k episodes and 20 seeds.\n",
        "-   Score is percentage of successful \"catch\" over first 10k episodes.\n",
        "-   Must log `episode`, `total_regret` for standard analysis.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "54UwF1sONICb"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "catch_df = DF[DF.bsuite_env == 'catch'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'catch', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "54hwhFHSreCS"
      },
      "outputs": [],
      "source": [
        "#@title plot average regret through learning (lower is better)\n",
        "catch_analysis.plot_learning(catch_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "kjnlywacKyWk"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of the agent averaged over 20 seeds.\n",
        "- Random policy has reward of 0  = regret of 1.6 = dashed line\n",
        "- Want to see a stable learning curve -\u003e 0 and fast!\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "m-F2cCPb4kNa"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "catch_analysis.plot_seeds(catch_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "sVZ3SfqYNY5H"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YtCu7IUwFYOY"
      },
      "source": [
        "### Mountain car"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "twcWpU2hb4XT"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/mountain_car.png\" alt=\"mountaincar diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A classic benchmark problem in RL.\n",
        "The agent controls an underpowered car and must drive it out of a valley.\n",
        "\n",
        "- Reward of -1 each step until the car reaches the goal.\n",
        "- Maximum episode length of 1000 steps.\n",
        "- Run 1000 episodes for 20 seeds.\n",
        "- Score is based on regret against \"good\" policy that solves in 25 steps.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "10AxDzgmFYOa"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "mountain_car_df = DF[DF.bsuite_env == 'mountain_car'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'mountain_car', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "PCiai_7_FYOe"
      },
      "outputs": [],
      "source": [
        "#@title plot average regret through learning (lower is better)\n",
        "mountain_car_analysis.plot_learning(mountain_car_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "mLE9dhuPclv5"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of the agent averaged over 20 seeds.\n",
        "- Dashed line is at 415 = average regret of a random agent.\n",
        "- Want to see a stable learning curve -\u003e 0 and fast!\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "UotK4LDa4m62"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "mountain_car_analysis.plot_seeds(mountain_car_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZpvpbLikNRRG"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iKRx2R7DEz5R"
      },
      "source": [
        "### Cartpole\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XJkGob4ebrRj"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/cartpole.png\" alt=\"cartpole diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A classic benchmark problem in RL.\n",
        "The agent controls a cart on a frictionless plane.\n",
        "\n",
        "- The poles starts near-to upright.\n",
        "- The observation is [x, x_dot, sin(theta), sin(theta)_dot, cos(theta), cos(theta)_dot, time_elapsed]\n",
        "- Episodes end once 1000 steps have occured, or |x| is greater than 1.\n",
        "- Reward of +1 when pole \u003e 0.8 height.\n",
        "- Run 1000 episodes for 20 seeds.\n",
        "- Score is percentage of timesteps balancing the pole.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "1UFLKInrEz5X"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "cartpole_df = DF[DF.bsuite_env == 'cartpole'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'cartpole', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "CeR8Vgf-Ez5b"
      },
      "outputs": [],
      "source": [
        "#@title plot average regret through learning (lower is better)\n",
        "cartpole_analysis.plot_learning(cartpole_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "UO1GwYM5ZiSI"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of the agent averaged over 20 seeds.\n",
        "- Maximum regret of 1000 per episode = dashed line\n",
        "- Want to see a stable learning curve -\u003e 0 and fast!\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "0vWWVcYR4lNZ"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "cartpole_analysis.plot_seeds(cartpole_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "jZuGijZZNBNU"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "UQ010l9tFsbG"
      },
      "source": [
        "## Reward noise\n",
        "\n",
        "To investigate the robustness of RL agents to noisy rewards, we repeat the \"basic\" experiments under differing levels of Gaussian noise.\n",
        "\n",
        "This time we allocate the 20 different seeds across 5 levels of Gaussian noise $N(0, \\sigma^2)$ for $\\sigma$ = noise\\_scale = $[0.1, 0.3, 1, 3, 10]$ with 4 seeds each."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "SWm2u8lpFsbK"
      },
      "source": [
        "### Bandit noise"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "o27LKuR0d-Bh"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/bandit.png\" alt=\"bandit diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "A simple independent-armed bandit problem.\n",
        "\n",
        "- The agent is faced with 11 actions with deterministic rewards [0.0, 0.1, .., 1.0] randomly assigned.\n",
        "- Run noise_scale = [0.1, 0.3, 1., 3, 10] for 4 seeds for 10k episodes.\n",
        "- Score is 1 - 2 * average_regret at 10k episodes.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "NAU9QFGGFsbL"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "bandit_noise_df = DF[DF.bsuite_env == 'bandit_noise'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'bandit_noise', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "cKrEMGjlFsbP"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "bandit_noise_analysis.plot_average(bandit_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "szaUeRc4ed6Q"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for largest noise_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "6fS80_PNF96e"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "bandit_noise_analysis.plot_learning(bandit_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DWHHIRkhejPZ"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for largest noise_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "9a89RWjd4n4I"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "bandit_noise_analysis.plot_seeds(bandit_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "bvbUja5cNbA_"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XeeO3UdkHvro"
      },
      "source": [
        "### MNIST noise"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KQU-bBpCeMXS"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/mnist.png\" alt=\"mnist diagram\" height=\"300\"/\u003e\n",
        "\n",
        "The \"hello world\" of deep learning, now as a contextual bandit.\n",
        "\n",
        "- Every timestep the agent must classify a random MNIST digit.\n",
        "- Reward +1 for correct, -1 for incorrect.\n",
        "- Run noise_scale = [0.1, 0.3, 1., 3, 10] for 4 seeds for 10k episodes.\n",
        "- Score is percentage of successful classifications.\n",
        "- Must log `episode`, `total_regret` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "3gHxu0e4Hvrp"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "mnist_noise_df = DF[DF.bsuite_env == 'mnist_noise'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'mnist_noise', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "HsNrBsl3Hvrx"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "mnist_noise_analysis.plot_average(mnist_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "m_q0mBBvfgq6"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for largest noise_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "6vKxHHfGHvr4"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "mnist_noise_analysis.plot_learning(mnist_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gFquOp8YfenD"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for largest noise_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "H_wJ_oLq4o5K"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "mnist_noise_analysis.plot_seeds(mnist_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Ft77N63zNcKf"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BhNvrDHtFsbW"
      },
      "source": [
        "### Catch noise"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "cuwO8ePyfnMF"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/catch.png\" alt=\"catch diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "DeepMind's internal \"hello world\" for RL agents.\n",
        "\n",
        "- The environment is a 5x10 grid with a single falling block per episodes (similar to Tetris).\n",
        "- The agent controls a single \"paddle\" pixel that it should use to \"catch\" the falling block.\n",
        "- If the agent catches the block reward +1, if the agent misses the block reward -1.\n",
        "- Run noise_scale = [0.1, 0.3, 1., 3, 10] for 4 seeds for 10k episodes.\n",
        "-   Score is percentage of successful \"catch\" over first 10k episodes.\n",
        "-   Must log `episode`, `total_regret` for standard analysis.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "PfrF58GRFsbX"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "catch_noise_df = DF[DF.bsuite_env == 'catch_noise'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'catch_noise', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "QIHLMrZBFsba"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "catch_noise_analysis.plot_average(catch_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KQne2I0TgIQN"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for largest noise_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "ChxlGZ3MGf9n"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "catch_noise_analysis.plot_learning(catch_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZTYufA6_gLA6"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for largest noise_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "_zT9y1NA4poe"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "catch_noise_analysis.plot_seeds(catch_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "wQ15ZnVgNc6n"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PvkWAhKAFsbo"
      },
      "source": [
        "### Mountain car noise"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "U3I25hMAf-5i"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/mountain_car.png\" alt=\"mountaincar diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A classic benchmark problem in RL.\n",
        "The agent controls an underpowered car and must drive it out of a valley.\n",
        "\n",
        "- Reward of -1 each step until the car reaches the goal.\n",
        "- Maximum episode length of 1000 steps.\n",
        "- Run noise_scale = [0.1, 0.3, 1., 3, 10] for 4 seeds for 1k episodes.\n",
        "- Score is based on regret against \"good\" policy that solves in 25 steps.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "_5AaZZRhFsbo"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "mountain_car_noise_df = DF[DF.bsuite_env == 'mountain_car_noise'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'mountain_car_noise', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "LCBxH1IJFsbu"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "mountain_car_noise_analysis.plot_average(mountain_car_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "l2f9gSdNgOay"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for largest noise_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "DDXxo_9vH1u9"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "mountain_car_noise_analysis.plot_learning(mountain_car_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "cm_g7R2cgM7D"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for largest noise_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "uWCJrwHK4tKb"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "mountain_car_noise_analysis.plot_seeds(mountain_car_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "atMZdJBQNfNy"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_OXjiYVTFsbe"
      },
      "source": [
        "### Cartpole noise"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zGPI8tqyf7E-"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/cartpole.png\" alt=\"cartpole diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A classic benchmark problem in RL.\n",
        "The agent controls a cart on a frictionless plane.\n",
        "\n",
        "- The poles starts near-to upright.\n",
        "- The observation is [x, x_dot, sin(theta), sin(theta)_dot, cos(theta), cos(theta)_dot, time_elapsed]\n",
        "- Episodes end once 1000 steps have occured, or |x| is greater than 1.\n",
        "- Reward of +1 when pole \u003e 0.8 height.\n",
        "- Run noise_scale = [0.1, 0.3, 1., 3, 10] for 4 seeds for 1k episodes.\n",
        "- Score is percentage of timesteps balancing the pole.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "wJpb89yUFsbf"
      },
      "outputs": [],
      "source": [
        "#@title parsing  data\n",
        "cartpole_noise_df = DF[DF.bsuite_env == 'cartpole_noise'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'cartpole_noise', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "asG7gr5_Fsbi"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "cartpole_noise_analysis.plot_average(cartpole_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "wPOUkkq1gJBS"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for largest noise_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "kZrjk8WSHqUm"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "cartpole_noise_analysis.plot_learning(cartpole_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vRa8iMYJgMCx"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by noise_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for largest noise_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "vBEKq2zq4qh5"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "cartpole_noise_analysis.plot_seeds(cartpole_noise_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZXAVj3SrNd6P"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zCNVq9M0IEpT"
      },
      "source": [
        "## Reward scale\n",
        "\n",
        "To investigate the robustness of RL agents to reward rewards, we repeat the \"basic\" experiments under differing levels of problem rescaling.\n",
        "\n",
        "This time we allocate the 20 different seeds across 5 levels of reward\\_scale = $[0.1, 0.3, 1, 3, 10]$ with 4 seeds each.\n",
        "\n",
        "In order to keep comparable statistics/regret we report rescaled regret/reward\\_scale."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "U5B77UDjIEpY"
      },
      "source": [
        "### Bandit scale"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "JscWOhKOiA60"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/bandit.png\" alt=\"bandit diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "A simple independent-armed bandit problem.\n",
        "\n",
        "- The agent is faced with 11 actions with deterministic rewards [0.0, 0.1, .., 1.0] randomly assigned.\n",
        "- Run reward_scale = [0.01, 0.1, 1., 10, 100] for 4 seeds for 10k episodes.\n",
        "- Score is 1 - 2 * average_regret at 10k episodes.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "XgOuCckcIEpb"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "bandit_scale_df = DF[DF.bsuite_env == 'bandit_scale'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'bandit_scale', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "3vR6CvD8IEpd"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "bandit_scale_analysis.plot_average(bandit_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dNHc7dECukLF"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for reward_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "_qkiiY16IEpi"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "bandit_scale_analysis.plot_learning(bandit_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "4gd87xYWuvLa"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for reward_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "hUTzD5tj4vZq"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "bandit_scale_analysis.plot_seeds(bandit_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "22rddkYVNf8f"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hPlIUnPgIBb5"
      },
      "source": [
        "### MNIST scale"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Xo6uxj7iiGSL"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/mnist.png\" alt=\"mnist diagram\" height=\"300\"/\u003e\n",
        "\n",
        "The \"hello world\" of deep learning, now as a contextual bandit.\n",
        "\n",
        "- Every timestep the agent must classify a random MNIST digit.\n",
        "- Reward +1 for correct, -1 for incorrect.\n",
        "- Run reward_scale = [0.01, 0.1, 1., 10, 100] for 4 seeds for 10k episodes.\n",
        "- Score is percentage of successful classifications.\n",
        "- Must log `episode`, `total_regret` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "KUp30dhSIBb6"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "mnist_scale_df = DF[DF.bsuite_env == 'mnist_scale'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'mnist_scale', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "opeE-AlYIBb8"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "mnist_scale_analysis.plot_average(mnist_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vpeAlsxluomy"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for reward_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "QoJe7269IBcA"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "mnist_scale_analysis.plot_learning(mnist_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nTCeZEkTuy9q"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for reward_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "G-KfqyMQ4wEa"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "mnist_scale_analysis.plot_seeds(mnist_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BM9Dde95Ngwn"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PweN9CwBIEps"
      },
      "source": [
        "### Catch scale"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xv0kzFFGiLnL"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/catch.png\" alt=\"catch diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "DeepMind's internal \"hello world\" for RL agents.\n",
        "\n",
        "- The environment is a 5x10 grid with a single falling block per episodes (similar to Tetris).\n",
        "- The agent controls a single \"paddle\" pixel that it should use to \"catch\" the falling block.\n",
        "- If the agent catches the block reward +1, if the agent misses the block reward -1.\n",
        "- Run reward_scale = [0.01, 0.1, 1., 10, 100] for 4 seeds for 10k episodes.\n",
        "-   Score is percentage of successful \"catch\" over first 10k episodes.\n",
        "-   Must log `episode`, `total_regret` for standard analysis.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "C03JJqYkIEpv"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "catch_scale_df = DF[DF.bsuite_env == 'catch_scale'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'catch_scale', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "d_vt5a__IEpz"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "catch_scale_analysis.plot_average(catch_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FMf_UvCFupuv"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for reward_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "VSXonpvuIEp5"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "catch_scale_analysis.plot_learning(catch_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "tyE8auVluzqa"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for reward_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "Q0CtosBw4wwx"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "catch_scale_analysis.plot_seeds(catch_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DG5P82d-NhoP"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "-Tbhu6tKIEqG"
      },
      "source": [
        "### Mountain car scale"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "0reSPxfYiVNL"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/mountain_car.png\" alt=\"mountaincar diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A classic benchmark problem in RL.\n",
        "The agent controls an underpowered car and must drive it out of a valley.\n",
        "\n",
        "- Reward of -1 each step until the car reaches the goal.\n",
        "- Maximum episode length of 1000 steps.\n",
        "- Run reward_scale = [0.01, 0.1, 1., 10, 100] for 4 seeds for 1k episodes.\n",
        "- Score is based on regret against \"good\" policy that solves in 25 steps.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "xtDJON_4IEqH"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "mountain_car_scale_df = DF[DF.bsuite_env == 'mountain_car_scale'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'mountain_car_scale', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "-EfiYNQhIEqI"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "mountain_car_scale_analysis.plot_average(mountain_car_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_WtMKupHurkM"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for reward_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "bZMxcgQkIEqL"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "mountain_car_scale_analysis.plot_learning(mountain_car_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZfIb7ZNnu1UK"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for reward_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "HzzdOtA_4yXu"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "mountain_car_scale_analysis.plot_seeds(mountain_car_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NcQJaFUBNjQe"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "USfDNwCtIEp9"
      },
      "source": [
        "### Cartpole scale"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "F0LcbVm3iSM6"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/cartpole.png\" alt=\"cartpole diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A classic benchmark problem in RL.\n",
        "The agent controls a cart on a frictionless plane.\n",
        "\n",
        "- The poles starts near-to upright.\n",
        "- The observation is [x, x_dot, sin(theta), sin(theta)_dot, cos(theta), cos(theta)_dot, time_elapsed]\n",
        "- Episodes end once 1000 steps have occured, or |x| is greater than 1.\n",
        "- Reward of +1 when pole \u003e 0.8 height.\n",
        "- Run reward_scale = [0.01, 0.1, 1., 10, 100] for 4 seeds for 1k episodes.\n",
        "- Score is percentage of timesteps balancing the pole.\n",
        "- Must log `episode`, `total_regret` for standard analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "mfSO4Q4gIEp-"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "cartpole_scale_df = DF[DF.bsuite_env == 'cartpole_scale'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'cartpole_scale', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "gdZXo0QwIEqB"
      },
      "outputs": [],
      "source": [
        "#@title average regret over learning (lower is better)\n",
        "cartpole_scale_analysis.plot_average(cartpole_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "rbV2q1snuqdy"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agents.\n",
        "- Look for reward_scale with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "eFZ2_koZIEqE"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "cartpole_scale_analysis.plot_learning(cartpole_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TOBg2c5Ku0Xq"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by reward_scale (lower is better)\n",
        "- Dashed line shows the performance of a random agent baseline.\n",
        "- Look for reward_scale with performance significantly better than baseline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "fyTcpBud4xlP"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "cartpole_scale_analysis.plot_seeds(cartpole_scale_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nawk_01lNifm"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "tV8NnR1pJIkN"
      },
      "source": [
        "## Exploration\n",
        "\n",
        "Exploration is the problem of prioritizing useful information for learning."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NMY_PV_PJWvy"
      },
      "source": [
        "### Deep sea\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2G2dMRuhJWvz"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/deep_sea.png\" alt=\"deep sea diagram\" height=\"300\"/\u003e\n",
        "\n",
        "Scalable chain domains that test for\n",
        "[deep exploration](https://arxiv.org/abs/1703.07608).\n",
        "\n",
        "The environment is an N x N grid with falling blocks similar to catch. However\n",
        "the block always starts in the top left. In each timestep, the agent can move\n",
        "the block \"left\" or \"right\". At each timestep, there is a small cost for moving\n",
        "\"right\" and no cost for moving \"left\". However, the agent can receive a large\n",
        "reward for choosing \"right\" N-times in a row and reaching the bottom right. This\n",
        "is the single rewarding policy, all other policies receive zero or negative\n",
        "return making this a very difficult exploration problem.\n",
        "\n",
        "-   Run deep_sea sizes N=5,6,7,..,50 for at least 10k episodes.\n",
        "-   Score is the percentage of N for which average regret \u003c 0.9 faster than 2^N.\n",
        "-   Must log `episode`, `total_return` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "BIAMOfnzJWv0"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "deep_sea_df = DF[DF.bsuite_env == 'deep_sea'].copy()\n",
        "deep_sea_plt = deep_sea_analysis.find_solution(deep_sea_df, SWEEP_VARS)\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'deep_sea', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "0Scr29pYJWv3"
      },
      "outputs": [],
      "source": [
        "#@title average regret by size through learning (lower is better)\n",
        "deep_sea_analysis.plot_regret(deep_sea_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "JNt8bTBkJWv9"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of suboptimal \"greedy\" algorithm\n",
        "- Look for largest size with performance significantly better than greedy agent.\n",
        "- Curves also show dynamics through time."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "aGZnDiV_JWv-"
      },
      "outputs": [],
      "source": [
        "#@title scaling of learning time with deep_sea size (lower + more blue is better)\n",
        "deep_sea_analysis.plot_scaling(deep_sea_plt, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XRTs5_yxJWwC"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Compute the number of episodes until the average regret \u003c 0.9 for each problem size.\n",
        "- Red dots have *not* solved the problem, but have simply performed only that many episodes.\n",
        "- Dashed line shows curve 2^N, which is the scaling we expect for agents without deep exploration.\n",
        "- Want to see consistent curve of blue dots signficantly *below* the dashed line -\u003e deep exploration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "L5SYqq4IJWwD"
      },
      "outputs": [],
      "source": [
        "#@title scaling of learning time with deep_sea size on log scale (lower + more blue is better)\n",
        "deep_sea_analysis.plot_scaling_log(deep_sea_plt, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "mCMKo6h0JWwG"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Plots exactly the same data as above, but on a logarithmic scale.\n",
        "- If we see polynomial scaling -\u003e this should result in a linear relationship between log(learning time) and log(size).\n",
        "- Want to see consistent line of blue dots significantly below the dashed line -\u003e deep exploration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "DtoFkEw9IzjO"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "deep_sea_analysis.plot_seeds(deep_sea_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2kHPsxo-NkUP"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Fada-WLrKDdA"
      },
      "source": [
        "### Stochastic deep sea\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "80Ih3cX4KDdD"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/deep_sea.png\" alt=\"deep sea stochastic diagram\" height=\"300\"/\u003e\n",
        "\n",
        "Scalable chain domains that test for\n",
        "[deep exploration](https://arxiv.org/abs/1703.07608).\n",
        "\n",
        "The environment is an N x N grid with falling blocks similar to catch. However\n",
        "the block always starts in the top left. In each timestep, the agent can move\n",
        "the block \"left\" or \"right\". At each timestep, there is a small cost for moving\n",
        "\"right\" and no cost for moving \"left\". However, the agent can receive a large\n",
        "reward for choosing \"right\" N-times in a row and reaching the bottom right. This\n",
        "is the single rewarding policy, all other policies receive zero or negative\n",
        "return making this a very difficult exploration problem.\n",
        "\n",
        "The stochastic version of this domain only transitions to the right with\n",
        "probability (1 - 1/N) and adds N(0,1) noise to the 'end' states of the chain.\n",
        "\n",
        "-   Run deep_sea sizes N=5,6,7,..,50 for at least 10k episodes.\n",
        "-   Score is the percentage of N for which average regret \u003c 0.9 faster than 2^N.\n",
        "-   Must log `episode`, `total_return` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "1qJ96InzKDdE"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "deep_sea_stochastic_df = DF[DF.bsuite_env == 'deep_sea_stochastic'].copy()\n",
        "deep_sea_stochastic_plt = deep_sea_stochastic_analysis.find_solution(deep_sea_stochastic_df, SWEEP_VARS)\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'deep_sea_stochastic', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "f1vKIKoMKDdH"
      },
      "outputs": [],
      "source": [
        "#@title average regret by size through learning (lower is better)\n",
        "deep_sea_stochastic_analysis.plot_regret(deep_sea_stochastic_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Sr0evA1DKDdN"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of suboptimal \"greedy\" algorithm\n",
        "- Look for largest size with performance significantly better than greedy agent.\n",
        "- Curves also show dynamics through time."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "T1oUdaXnKDdO"
      },
      "outputs": [],
      "source": [
        "#@title scaling of learning time with deep_sea_stochastic size (lower + more blue is better)\n",
        "deep_sea_stochastic_analysis.plot_scaling(deep_sea_stochastic_plt, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "aIcze9ScKDdR"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Compute the number of episodes until the average regret \u003c 0.9 for each problem size.\n",
        "- Red dots have *not* solved the problem, but have simply performed only that many episodes.\n",
        "- Dashed line shows curve 2^N, which is the scaling we expect for agents without deep exploration.\n",
        "- Want to see consistent curve of blue dots signficantly *below* the dashed line -\u003e deep exploration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "syBJ_aGmKDdS"
      },
      "outputs": [],
      "source": [
        "#@title scaling of learning time with deep_sea size on log scale (lower + more blue is better)\n",
        "deep_sea_stochastic_analysis.plot_scaling_log(deep_sea_stochastic_plt, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "rQF9BDzoKDdY"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Plots exactly the same data as above, but on a logarithmic scale.\n",
        "- If we see polynomial scaling -\u003e this should result in a linear relationship between log(learning time) and log(size).\n",
        "- Want to see consistent line of blue dots significantly below the dashed line -\u003e deep exploration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "s_bWpZ5UJrwJ"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "deep_sea_stochastic_analysis.plot_seeds(deep_sea_stochastic_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DhGbNwJfNl1m"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "g_mroLiVK1RE"
      },
      "source": [
        "### Cartpole swingup\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "S2KWDe9dK1RH"
      },
      "source": [
        "\n",
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/cartpole.png\" alt=\"cartpole diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A difficult cartpole swingup task with sparse rewards and a cost for moving.\n",
        "This domain is somewhat similar to \"deep sea\" but cannot be solved easily by tabular reinforcement learning algorithms.\n",
        "\n",
        "- The observation is `[x, cos_theta, sin_theta, x_dot, theta_dot, x_central]`\n",
        "- The dynamics are given by the classic cartpole from dm [control suite](https://github.com/deepmind/dm_control/blob/master/all_domains.png)\n",
        "- Each episode begins with the pole hanging downwards and ends after 1000 timesteps.\n",
        "- There is a small cost of -0.1 for any movement of the pole.\n",
        "- There is a reward of +1 only if:\n",
        "  - x_dot, theta_dot \u003c 1\n",
        "  - pole_height \u003e 1 - `difficulty_scale`\n",
        "  - x \u003c 1 - `difficulty_scale`\n",
        "\n",
        "The parameter `difficulty_scale` acts as a scaling for the depth of exploration, similar to the \"size\" in deep sea.\n",
        "To run this experiment:\n",
        "\n",
        "- Run the agent on difficulty_scale = 0, 0.05, 0.1, .. , 0.95 for 1k episodes\n",
        "- Score is proportion of runs that achieve an average_return \u003e 0 at any point.\n",
        "- Must log `episode`, `total_return` for standard analysis\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "odVWa8S5K1RJ"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "cartpole_swingup_df = DF[DF.bsuite_env == 'cartpole_swingup'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'cartpole_swingup', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "Gc4VSuUsK1RO"
      },
      "outputs": [],
      "source": [
        "#@title scaling with difficulty scale (higher + more blue is better)\n",
        "cartpole_swingup_analysis.plot_scale(cartpole_swingup_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "X4HWtEcIK1RT"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- For each height threshold, look at the best observed return.\n",
        "- If the observed return is greater than 500 ==\u003e the pole was swung upright and balanced for at least 5 seconds.\n",
        "- Look for higher scores and more blue."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "r5HeI4rrK1RV"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "cartpole_swingup_analysis.plot_learning(cartpole_swingup_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Uta3sgNOK1RY"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average return through time (higher is better).\n",
        "- Dashed line shows the performance of an agent that does not move = 0.\n",
        "- Look for largest difficulty_scale with performance significantly better than staying still."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "M3tiBC9442n1"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "cartpole_swingup_analysis.plot_seeds(cartpole_swingup_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FAlAYE7oNms2"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Jpj7JjESSs_J"
      },
      "source": [
        "## Credit assignment\n",
        "\n",
        "This is a collection of domains for credit assignment."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "k4S-Q5B5Sysn"
      },
      "source": [
        "### Umbrella length"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "swsqn6tXSysr"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/umbrella.png\" alt=\"umbrella diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A stylized problem designed to highlight problems to do with temporal credit assignment and scaling with time horizon.\n",
        "\n",
        "- The state observation is [need_umbrella, have_umbrella, time_to_go,] + n \"distractor\" features that are iid Bernoulli.\n",
        "- At the start of each episode the agent observes if it will need an umbrella.\n",
        "- It then has the chance to pick up an umbrella only in the first timestep.\n",
        "- At the end of the episode the agent receives a reward of +1 if it made the correct choice of umbrella, but -1 if it made the incorrect choice.\n",
        "- During chain_length intermediate steps rewards are random +1 or -1.\n",
        "\n",
        "The experiment setup:\n",
        "- Run umbrella_chain with n_distractor=20 and sweep chain_length=1..100 logarithmically spaced for 10k episodes.\n",
        "- Score is percent of tasks with average reward per episode \u003e 0.5.\n",
        "- Must log `episode`, `total_return`, `total_regret` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "H22CVW-gSyss"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "umbrella_length_df = DF[DF.bsuite_env == 'umbrella_length'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'umbrella_length', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "YW9UOQTRSysw"
      },
      "outputs": [],
      "source": [
        "#@title average regret after 10k episodes (lower is better)\n",
        "umbrella_length_analysis.plot_scale(umbrella_length_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "besZfRI7Sys1"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Compute the average regret after 10k episodes for each chain_length problem scale.\n",
        "- Red dots have *not* solved the problem, blue dots made significant progress (average regret \u003c 0.5)\n",
        "- Dashed line shows regret of a random agent = 1.0.\n",
        "- We want to see lots of blue dots with low regret for large chain_length."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "3STrnTWrSys2"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "umbrella_length_analysis.plot_learning(umbrella_length_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "InmXA_iKSys5"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of a random agents (regret = 1.0)\n",
        "- Look for largest chain_length with performance significantly better than random agent.\n",
        "- Curves also show dynamics through time.\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "O1pSwPi-430_"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "umbrella_length_analysis.plot_seeds(umbrella_length_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "u8GX_ZBxNnXe"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "kDKk7PhyTEif"
      },
      "source": [
        "### Umbrella distract"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2SqqFSKaTEii"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/umbrella.png\" alt=\"umbrella diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "A stylized problem designed to highlight problems to do with temporal credit assignment and scaling with time horizon.\n",
        "\n",
        "- The state observation is [need_umbrella, have_umbrella, time_to_go,] + n \"distractor\" features that are iid Bernoulli.\n",
        "- At the start of each episode the agent observes if it will need an umbrella.\n",
        "- It then has the chance to pick up an umbrella only in the first timestep.\n",
        "- At the end of the episode the agent receives a reward of +1 if it made the correct choice of umbrella, but -1 if it made the incorrect choice.\n",
        "- During chain_length intermediate steps rewards are random +1 or -1.\n",
        "\n",
        "The experiment setup:\n",
        "- Run umbrella_chain with n_distractor=20 and sweep chain_length=1..100 logarithmically spaced for 10k episodes.\n",
        "- Score is percent of tasks with average reward per episode \u003e 0.5.\n",
        "- Must log `episode`, `total_return`, `total_regret` for standard analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "fWP2zwM8TEij"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "umbrella_distract_df = DF[DF.bsuite_env == 'umbrella_distract'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'umbrella_distract', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "IiVxaDffTEim"
      },
      "outputs": [],
      "source": [
        "#@title average regret after 10k episodes (lower is better)\n",
        "umbrella_distract_analysis.plot_scale(umbrella_distract_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_xYIXBJwTEiq"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Compute the average regret after 10k episodes for each chain_length problem scale.\n",
        "- Red dots have *not* solved the problem, blue dots made significant progress (average regret \u003c 0.5)\n",
        "- Dashed line shows regret of a random agent = 1.0.\n",
        "- We want to see lots of blue dots with low regret for large chain_length."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "XZSfnKjcTEir"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "umbrella_distract_analysis.plot_learning(umbrella_distract_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "05vEFGn7TEiv"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of a random agents (regret = 1.0)\n",
        "- Look for largest chain_length with performance significantly better than random agent.\n",
        "- Curves also show dynamics through time.\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "IFvHvi5H47yH"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "umbrella_distract_analysis.plot_seeds(umbrella_distract_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TCnnE435NoHv"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5_NxfUeUUTCz"
      },
      "source": [
        "### Discounting chain"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "O30h3qfmUTC3"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/discounting_chain.png\" alt=\"discount diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A stylized problem designed to highlight an agent's ability to correctly maximize cumulative rewards without discounting bias.\n",
        "- The only decision that actually matters is the agent's *first* of the episode, after which the agent is locked into a \"chain\" irrespective of actions.\n",
        "- Each chain gives a non-zero reward only at one step of the length-100 episode: [1, 3, 10, 30, 100] steps.\n",
        "- Each chain gives a reward of +1, except for the optimal_horizon, which gives a reward of +1.1\n",
        "- Many agents with discounting will struggle to maximize cumulative returns.\n",
        "\n",
        "The experiment setup:\n",
        "- Run each optimal_horizon [1, 3, 10, 30, 100], each with 5 seeds for 1k episodes.\n",
        "- Score is average regret * 10.\n",
        "- Must log `episode`, `total_return` for standard analysis"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "GpSud5k1UTC4"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "discounting_chain_df = DF[DF.bsuite_env == 'discounting_chain'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'discounting_chain', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "l-Qyf4CqUTC8"
      },
      "outputs": [],
      "source": [
        "#@title average regret after 1k episodes (lower is better)\n",
        "discounting_chain_analysis.plot_average(discounting_chain_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YaBqIHXQUTDD"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Display the average regret after 10k episodes by optimal_horizon (lower is better)\n",
        "- Dashed line shows the performance of a random agents (regret = 0.8)\n",
        "- Look for largest horizon with performance significantly better than random agent."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "6JQbVV9dUTDE"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "discounting_chain_analysis.plot_learning(discounting_chain_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "QLri9ZsjUTDJ"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of a random agents (regret = 0.8)\n",
        "- Look for largest horizon with performance significantly better than random agent.\n",
        "- Curves also show dynamics through time.\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "11uMqmE9484R"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "discounting_chain_analysis.plot_seeds(discounting_chain_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GEhg1QutNpbb"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3tACBZKzTfNS"
      },
      "source": [
        "## Memory\n",
        "\n",
        "A collection of experiments designed to test memory capabilities."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "F1i-6W76Tiba"
      },
      "source": [
        "### Memory length"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Ajm0qATFTibe"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/memory_chain.png\" alt=\"memory diagram\" height=\"300\"/\u003e\n",
        "\n",
        "\n",
        "A stylized [T-maze](https://en.wikipedia.org/wiki/T-maze) problem designed to highlight an agent's ability to remember important information and use it to make good decisions.\n",
        "- At the beginning of the episode the agent is provided a context of +1 or -1.\n",
        "- At all future timesteps the context is equal to zero and a countdown until the end of the episode.\n",
        "- At the end of the episode the agent must select the correct action corresponding to the context to reward +1 or -1.\n",
        "\n",
        "The experiment setup:\n",
        "- Run memory sizes 1..100 logarithmically spaced.\n",
        "- Score is proportion of memory sizes with average regret \u003c 0.5.\n",
        "- Must log `episode`, `total_return` for standard analysis"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "8cMpQ1AHTibf"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "memory_len_df = DF[DF.bsuite_env == 'memory_len'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'memory_len', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "RH8ko4YFTibi"
      },
      "outputs": [],
      "source": [
        "#@title memory scaling (lower + more blue is better)\n",
        "memory_len_analysis.plot_scale(memory_len_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "aqTarGL-Tibm"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Compute the average regret after 10k episodes for each memory_length problem scale.\n",
        "- Red dots have *not* solved the problem, blue dots made significant progress (average regret \u003c 0.5)\n",
        "- Dashed line shows regret of a random agent = 1.0.\n",
        "- We want to see lots of blue dots with low regret for large memory_length."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "Gvee-SBNTibn"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "memory_len_analysis.plot_learning(memory_len_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "r3f55zyLTibs"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of a random agents (regret = 1.0)\n",
        "- Look for largest memory_length with performance significantly better than random agent.\n",
        "- Curves also show dynamics through time.\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "FTxbDm_G8-O9"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "memory_len_analysis.plot_seeds(memory_len_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FPginvBeNqju"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "eWua8ocyT5eE"
      },
      "source": [
        "### Memory size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "n5EmxcwJT5eK"
      },
      "source": [
        "\u003cimg src=\"https://storage.cloud.google.com/bsuite-colab-images/memory_chain.png\" alt=\"memory diagram\" height=\"300\"/\u003e\n",
        "\n",
        "A stylized [T-maze](https://en.wikipedia.org/wiki/T-maze) problem designed to highlight an agent's ability to remember important information and use it to make good decisions.\n",
        "- At the beginning of an episode the agent is provided an N bit context vector.\n",
        "- After a couple of steps the agent is provided a query as an integer number between `0` and `num_bits-1` and must select the correct action corresponding to `context[query]`.\n",
        "\n",
        "The experiment setup:\n",
        "- Run memory sizes 1..100 logarithmically spaced.\n",
        "- Score is proportion of memory sizes with average regret \u003c 0.5.\n",
        "- Must log `episode`, `total_return` for standard analysis"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "AVTgmQufT5eM"
      },
      "outputs": [],
      "source": [
        "#@title parsing data\n",
        "memory_size_df = DF[DF.bsuite_env == 'memory_size'].copy()\n",
        "summary_analysis.plot_single_experiment(BSUITE_SCORE, 'memory_size', SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "6-ebdsGrT5eP"
      },
      "outputs": [],
      "source": [
        "#@title memory scaling (lower + more blue is better)\n",
        "memory_size_analysis.plot_scale(memory_size_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "la_nvowET5eU"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Compute the average regret after 10k episodes for each memory_sizegth problem scale.\n",
        "- Red dots have *not* solved the problem, blue dots made significant progress (average regret \u003c 0.5)\n",
        "- Dashed line shows regret of a random agent = 1.0.\n",
        "- We want to see lots of blue dots with low regret for large memory_sizegth."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "BbdlXbUGT5eV"
      },
      "outputs": [],
      "source": [
        "#@title average regret through learning (lower is better)\n",
        "memory_size_analysis.plot_learning(memory_size_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "FIIFRKiBT5ea"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "- Learning curves of average regret through time (lower is better).\n",
        "- Dashed line shows the performance of a random agents (regret = 1.0)\n",
        "- Look for largest memory_length with performance significantly better than random agent.\n",
        "- Curves also show dynamics through time.\n",
        "- Smoothing is performed with rolling mean over 10% of data with confidence bar at 95% Gaussian standard error."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "-1qX20c_4_JW"
      },
      "outputs": [],
      "source": [
        "#@title plot performance by seed (higher is better)\n",
        "memory_size_analysis.plot_seeds(memory_size_df, SWEEP_VARS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fNxUm-A2Nx42"
      },
      "source": [
        "**Parsing the plot above:**\n",
        "\n",
        "- Here we can see the performance of each agent individually through time.\n",
        "- Higher scores are better, but individual runs may be noisy.\n",
        "- Use this plot to diagnose strange agent behaviour."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KWdGnbSPIbmd"
      },
      "source": [
        "## Exporting as PDF\n",
        "\n",
        "- Run all colab cells above in `Colaboratory`\n",
        "- Run the cell below to download a compressed `images.zip`\n",
        "- Copy `images/` in `bsuite/reports/images`\n",
        "- Run `bsuite/reports/bsuite_report.tex` to generate a summary pdf report"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "VwAKOdzHfmAr"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from google.colab import files\n",
        "\n",
        "# Save images required for the reports in an `images/` folder.\n",
        "if not os.path.exists('images'):\n",
        "  os.makedirs('images')\n",
        "\n",
        "__radar_fig__.savefig('images/radar_plot.png', bbox_inches=\"tight\")\n",
        "\n",
        "# Compress folder and download\n",
        "!zip -r /images.zip /content/images\n",
        "files.download(\"images.zip\") "
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "ds789Mrq5LmR",
        "dwIcX62dDnNE",
        "vQmNzVbBDqZa",
        "_ypLP6DZHZc8",
        "GrTjfY11MD5E",
        "YtCu7IUwFYOY",
        "iKRx2R7DEz5R",
        "UQ010l9tFsbG",
        "SWm2u8lpFsbK",
        "XeeO3UdkHvro",
        "BhNvrDHtFsbW",
        "PvkWAhKAFsbo",
        "_OXjiYVTFsbe",
        "zCNVq9M0IEpT",
        "U5B77UDjIEpY",
        "hPlIUnPgIBb5",
        "PweN9CwBIEps",
        "-Tbhu6tKIEqG",
        "USfDNwCtIEp9",
        "tV8NnR1pJIkN",
        "NMY_PV_PJWvy",
        "Fada-WLrKDdA",
        "g_mroLiVK1RE",
        "Jpj7JjESSs_J",
        "k4S-Q5B5Sysn",
        "kDKk7PhyTEif",
        "5_NxfUeUUTCz",
        "3tACBZKzTfNS",
        "F1i-6W76Tiba",
        "eWua8ocyT5eE"
      ],
      "name": "results.ipynb",
      "provenance": [],
      "toc_visible": true,
      "version": "0.3.2"
    },
    "kernelspec": {
      "display_name": "Python 2",
      "name": "python2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
